Frequency-oriented hierarchical fusion network for single image raindrop removal

Single image raindrop removal aims at recovering high-resolution images from degraded ones. However, existing methods primarily employ pixel-level supervision between image pairs to learn spatial features, thus ignoring the more discriminative frequency information. This drawback results in the loss of high-frequency structures and the generation of diverse artifacts in the restored image. To ameliorate this deficiency, we propose a novel frequency-oriented Hierarchical Fusion Network (HFNet) for raindrop image restoration. Specifically, to compensate for spatial representation deficiencies, we design a dynamic adaptive frequency loss (DAFL), which allows the model to adaptively handle the high-frequency components that are difficult to recover. To handle spatially diverse raindrops, we propose a hierarchical fusion network to efficiently learn both contextual information and spatial features. Meanwhile, a calibrated attention mechanism is proposed to facilitate the transfer of valuable information. Comparative experiments with existing methods indicate the advantages of the proposed algorithm.


Introduction
Image restoration in severe weather has always been a research topic in low-level vision tasks [1][2][3][4].When the imaging device captures a variety of distracting factors such as rain, snow, and fog in bad weather may cause the missing of structure and degrade the visual quality.In recent years, there has been an increasing interest in the study of single-image deraining, and significant advances have been made [5,6].However, single image raindrop removal is extremely challenging due to the diversity and uncertainty of raindrop morphology.
Previous studies of single-image raindrop removal are mainly summarized into two trends: physical model-based and deep learning-based [7][8][9][10][11][12][13]. Traditional physical model-based approaches rely on prior information to establish mathematical models and decompose the raindrop image into raindrop maps and clean background images by sparse coding and dictionary learning [14][15][16].Although the model-based approach can produce expected results, it can only extract the shallow features of raindrop images [17,18].Since the deep semantic information can not be well represented, such methods will result in artifacts or color distortion in the restored image.Conversely, the deep learning-driven solutions have yielded more promising outcomes because of its ability to represent image content and semantic information through non-linear mapping.These methods accomplish restoration from rain image to rain-free image by introducing constraints related to the physical properties of raindrops.Specifically, Qian et al. [19] introduced the recurrent attention mechanism to locate the raindrop regions and used adversarial training to accomplish raindrop image restoration.Based on the positive and negative effects of raindrops on the background, Shao et al. [20] proposed the uncertainty mask to enhance the representation accuracy of raindrops.In addition, some progressive restoration models are meticulously tailored to facilitate the reconstruction of textures and structures.
However, previous raindrop removal efforts [19][20][21][22][23] usually narrow the gap between the real image and the restored image in the spatial-domain.Despite achieving satisfactory performance in some scenarios, the restored images still suffer from distortion.This is because the frequency-domain distance between images is also an important prior knowledge that they do not take into account.The frequency domain representation contains two advantages: (1) Global properties.According to the mathematical formulation of the Fourier transform [24], each frequency in the Fourier domain is the result of the summation of all pixels in the spatial domain.Therefore, the frequency domain representation has global perceptual capability.( 2 Based on the aforementioned analysis, we propose a novel frequency-aware hierarchical fusion network (HFNet) for removing raindrops.To be specific, we propose a dynamic adaptive frequency loss (DAFL) to narrow the gap between the restored image and the real image in the frequency domain.This objective function allows the network to focus on hard frequencies with dynamic weights, and preserve a more complete image structure by applying frequency distance constraints at different flexible scales.Then, we design a hierarchical fusion network which boosts the restoration quality by fusing enriched features from different stages.The proposed framework consists of three stages, each of which first employ a recurrent processing module (RPM) to extract reliable raindrop location information.During the first two stages, an efficient UNet is adopted to obtain multi-scale complementary features.Subsequently, we devise the calibrated attention module (MCAB) to guide the delivery of valuable information.At the final stage, we design a cascade resolution module (CRM) to restore high-resolution images by exploring deep texture details.
The primary contributions are listed below: • We propose a dynamic adaptive frequency loss (DAFL) to complement the shortcomings of spatial constraints, which effectively focuses on the learning of high-frequency information that is hard to generate.
• We design hierarchical feature fusion networks to learn enriched contextual information and spatial features.Meanwhile, a calibrated attention mechanism is proposed to facilitate the transfer of useful information in different stages.
• Experiments validate that our method has significant advantages over existing methods and is more robust to spatially diverse raindrop degradation.

Rain removal
The study of raindrop image restoration is gradually attracting the attention of researchers due to its challenge and significance.Initially, researchers explored the work of video rain removal based on time domain and sparse property analysis [25][26][27].Compared with video rain removal that can achieve rain removal from the spatio-temporal correlation between frames [28,29], single-image rain removal is more challenging.This is because it obtains less prior information, especially for raindrop removal with more complex situations.The current research on single-image raindrop removal tasks is mainly divided into traditional modeldriven methods [14,16,30,31] and deep learning-based data-driven methods [19-22, 32, 33].The single-image rain removal work was first developed using a traditional model-based approach.These methods rely more on the properties of the rain map and the background scene to construct and optimize the constraint function.For example, Kang et al. [30] introduce a morphological component analysis method to deal with raindrop removal by applying a bilateral filter.To further enhance rain removal efficiency, Gu et al. [31] propose a model that involves both convolutional properties and synthetic sparse representation to effectively extract the image texture layer and flexibly model different image structure types.Nevertheless, these methods are more laborious in processing images with heavy raindrop density or similar semantic information between the raindrop map and the background image.
Deep learning is currently making dramatic advances in low-level vision tasks, including single image raindrop removal.Eigen et al. [32] trained a convolutional neural network to map raindrop images to clean ones.However, this approach is not suitable for handling complex raindrop scenarios because of the simple network structure.Later, Qian et al. [19] propose an attention generative adversarial network (AttGAN), which introduces the recurrent attention module into the GAN to highlight the raindrop region.To better utilize the interaction between different blocks, Liu et al. [33] design a dual residual connection network (DuRN) in a modular fashion to facilitate reconstructing the structure and texture of an image.Shao et al. [20] explore the effect of raindrop uncertainty on the background and design a multi-scale attentional network to recover more complete details by utilizing multi-scale features.Unfortunately, these methods primarily explore the spatial features and neglect the more discriminative frequency information, resulting in artifacts on the recovered image.

Multi-stage hierarchical learning
In contrast to single-stage deraining methods, multi-stage learning achieves better results by increasing the depth of the network and focusing on different features at each stage [34][35][36][37].However, we note that the feature extraction operations of upsampling and downsampling result in the missing of essential features in the image.Simultaneously, the increased complexity of the network makes the contribution of each module unclear.To ameliorate these limitation, Ren et al. [21] devide a progressive recurrent network, which provides a better and simpler deraining network.Along this direction, Jiang et al. [22] analyze the complementary properties of rain patterns at different scales and design a multi-scale fusion network to promote the detail rendering in recovered images.
Different from existing raindrop removal models in [19-22, 32, 33], we propose a frequency-level constraints to ensure that the restored image and the clear image are as consistent as possible in the frequency domain.Besides, the designed hierarchical feature fusion networks makes full use of the multi-stage feature, dramatically reducing the consumption of valuable information by incorporating intermediate attributes from various periods.Meanwhile, The experiment in Section IV shows the superiority of our method.

Proposed method
In this section, we propose a frequency-guided hierarchical fusion network (HFNet) to remove raindrops by utilizing frequency representation and cross-stage progressive feature fusion.The overview of the designed HFNet is displayed in Fig 2 .We will detail the implementation of the individual modules in the next part.

Hierarchical fusion network
It is depicted in Fig 2 that our HFNet takes the form of a three-stage.In each stage, we extract profitable multi-scale features and accurate spatial details through the cooperation of several modules.Also, a variety of attention modules is introduced to highlight the critical region.
Recurrent processing module.Fig 3(a) shows the structure of the RPM module.It is mainly composed of convolution layer, Long Short-Term Memory (LSTM) unit, and Res-Blocks [38,39].The input of LSTM comes from both the output of the front convolution layer and the LSTM in the previous iteration.Meanwhile, we adopt five residual blocks in Res-Blocks, each containing two convolution layers and a ReLU activation function.RPM can gradually extract the deep representation of raindrop images through the recurrent iteration.Each iteration uses the original raindrop image and the output of RPM in the previous iteration as the input, which can be expressed as: where f RPM denotes the RPM recurrent operation.Also, X, X t−1 and X t mean the original image, output features of the previous stage, and output features of the current stage, respectively.Encoder-decoder with joint attention block.Different from the previous ones [40,41], the adopted encoder-decoder incorporates a JAM module at the skip connection.JAM includes a dual-pooling channel attention block (DCAB) and a dual-pooling spatial attention block (DSAB) in series, as well as a residual structure, as shown in Fig 3(b).This strategy can not only pass the original feature at different scales to the decoder, but also amplify the essential features and suppress the effect of potentially disturbing features.Here our encoder adopts  More specifically, the input of MCAB comes from features F de extracted by the decoder and the original image I or .The channel of F de will be adjusted from C to 3 through a convolution operation, which is also the acquisition process of residual image I re .Then I re is subtracted by I or to obtain the restored image I, which can be expressed as: We also perform a convolution and sigmoid operation on I to generate the multi-channel attention map A mc .It is adopted to mark the usefulness of the information in F de and combine the result with F de as the output feature of MCAB: where f c means the conv 1 × 1 operation and � means the operation of element-wise product.Cascade resolution module.In the final stage, we designed a CRM module to acquire high-resolution images by extracting degradation details at various hierarchical levels.As in the bottom of Fig 2, the module consists of four resolution blocks R i , each of which is implemented by multiple DCABs.The first three R i blocks receive features from the encoderdecoder and the output of the previous block, respectively.Afterwards, the feature scales are integrated by up-sampling operation.Each block is assembled in a cascading fashion, which can be expressed as: where x i denotes the input of R i+1 , i 2 [0, 3].F en i and F de i denotes the feature output by the encoder and decoder respectively.Also, f up×2 and f up×4 denotes the scale amplification of the features by 2 and 4 times after upsampling.

Objective function
In order to adaptively focus on difficult-to-synthesize high-frequency components, we design an adaptive frequency loss with multiple flexible scale factors: X NÀ 1 v¼0 w j ðu; vÞjIðu; vÞ À Gðu; vÞj 2 ; ð8Þ where (u, v) means the coordinates of the spatial frequencies, and I(u, v) and G(u, v) denote the restored image and ground truth image at (u, v), respectively.w j (u, v)(j = 1, 2, 3) is the weight of the spatial frequencies with j-th flexible scale factors, indicated as: where μ 1 , μ 2 and μ 3 denote trade-off coefficients of 0.5, 0.3 and 0.2, respectively.Also, α 0 , α 1 and α 2 are 1, 1/2 and 1/3, respectively, which represent flexible scale factors.Therefore, the total loss is formally as: where J denotes the number of flexible scale factors.When the frequency content at a specific coordinate belongs to the hard frequency, the function adaptively allocates a larger weight to attract the attention of the network, while the weight is reduced for easy frequencies.Besides, we set the following loss function at each stage S to optimize HFNet: where λ 1 and λ 2 is set to 0.05 and 1 respectively [42].L char denotes the Charbonnier loss [43], which approximates the distance between the recovered and clean images in pixels: L char ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where I and G denotes the restored image and ground-truth image, respectively, and constant ε empirically set to 10 −3 .Besides, we take edge loss L edge to constrain the edge texture consistency between the restored image and the ground-truth image: L edge ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where Δ donates the Laplacian operator [44].Algorithm 1 Training steps of the proposed method.Multi-scale features: Multi-channel attention: Cross-stage feature fusion: Raindrop removal: I ¼ f CRM ðI or ; F S iÀ 1 Þ;

9:
Charbonnier loss: L char ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Edge loss: L edge ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Adaptive frequency loss: L adap ¼ P J j¼1 L j adap ; 12: Total loss: Optimize the network with L; 14: endfor 15: return M trained .

Experiments
We evaluate the effectiveness of the proposed method on PSNR [45], SSIM [46], LPIPS [47], and FID [48] metrics.Our details of the realization, comparison experiments and ablation studies are provided below.

Implementation details
We utilize an NVIDIA RTX 2080Ti GPU to train our HFNet.The learning rate employed for training is set to [2 × 10 −6 , 1 × 10 −4 ]. β 1 and β 2 are set to 0.9 and 0.999 to serve as Adam's optimization parameters.The images are cropped to 256 × 256 in a batch size of 4 as input.We finished the training with 1000 epochs.Algorithm 1 gives the specific steps for training HFNet.

Dataset
The dataset we used contains two types.One was gathered by Qian et al. [19] which included a training set (861) and two test sets.The numbers of Test a and Test b are 58 and 249, respectively.This dataset contains raindrop images with different scenes, densities, and transparencies.The other is images we took in real raindrop scenes from nature to verify the generalizability of the model.

Comparative experiments
Quantitative comparison.Table 1 compares the raindrop removal results of our method and existing methods on different metrics.It can be seen that our method has a considerable improvement compared to DSC [14], pix2pix [49], and DDN [49].And our method also outperforms AttGAN [19] under the same conditions.Although PReNet [21] and MSPFN [22] utilize a cross-stage learning approach, they do not consider the multi-channel complementary feature and the frequency-domain distance between images.As shown in Table 1, our method has improved by about 1.06dB and 1.11dB on Test a compared to MSPFN and UMAN, respectively.Moreover, it expands the gain on Test b to 0.8dB and 2.01dB, respectively.Additionally, we note that the proposed method is inferior to AttGAN in LPIPS on Test a .It is mainly because Test a has fewer samples and tends to overfit the network.On Test b with more samples, our method performs optimally in all metrics except in FID is similar to UMAN.Such results also indicates the strength of our method in generalization.Qualitative comparison.Fig 5 shows the visual results of various deraining methods for raindrops with different shapes, densities, and transparencies.It can be seen that all methods achieve satisfactory removal effects for the image with small density raindrops, but our approach has better authenticity in global and local perception.Since the shape and transparency of raindrops become diverse with increasing density, there will be slight artifacts in the comparison method when processing images with higher density raindrops.While, our method does not suffer significantly in terms of visual quality.The primary reason is that our soft mask learns more accurate raindrop shape and transparency during training.For removing small shapes and highly transparent raindrops, AttGAN and MSPFN leaves a noticeable amount of unremoved raindrops.PReNet also appears larger artifacts and distortion.In contrast, the images restored by our strategy have sharper textures.
Fig 6 demonstrates the advantages of the proposed method in processing images in which the color and content of raindrops are changed by background targets.As can be seen from the figure, albeit the compared methods can remove most of the raindrops, none of them can recover the regions with color-varying or content-changing well.Conversely, our approach has a tremendous advantage in removing raindrops with such changes.
Results on natural images.We further verify the generalization of our method on natural raindrop images.As shown in Fig 7, our method exhibits fewer artifacts for raindrops with different shapes, colors, and transparency.This illustrates the better generalization of our proposed method.We also find that for some of the rainy regions similar to rain streaks, our method accomplishes the restoration with high-quality image details.It also indicates that our method is applicable to the removal of rain streaks.From Table 2, it can be seen a slight degradation in the performance of the network after removing DCAB and DSAB from HFNet.Since RPM provides reliable image features and JAM focuses on important region in terms of channel and spatial dimensions, the effect on the raindrop removal results is greater after removing RPM and JAM.Meanwhile, there is a significant performance reduction after removing MCAB, CRM and CSC.This is primarily due to the lack of some complementary feature details under multi-channel and cross-stage.The PSNR at this setting is correspondingly reduced by about 0.61dB, 0.93dB, and 1.04dB on Test a .This result reveals the vital contribution of these three modules to the whole network.The visual comparison results are shown in Figs 8 and 9.  3, with the increase of N s , the effectiveness of raindrop removal presents an upward trend.The network achieves the best performance when N s is 5. Furthermore, the difference between the raindrop removal results with N s set to 4 and 5 is not significant, while taking 5 in N s will increase the complexity.Considering the computational efficiency of the network, we set N s of CSFNet as 4.
Effectiveness of loss function.We further performed ablation studies on the loss function in Table 4.It can be noticed that the performance of the model gradually improves with the addition of different losses.The experimental results show that the adaptive frequency loss with multi-flexible scale factors can improve the authenticity of the restored image.The reason for this is that the constraints at different scales present complementary effects.We also visualize the output results of each stagein in Fig 10 .It can be seen that the visual quality of the restored image gradually improves as the number of stages increases.

Conclusion
In this paper, we propose a novel frequency-aware Hierarchical Fusion Network (HFNet) for raindrop image restoration.The exploration of frequency representation allows the network to extract rich and accurate raindrop features.Meanwhile, we utilize the synergistic cooperation of hierarchical fusion and calibrated attention mechanism to preserve a more complete background structure.We verify the superiority of the proposed method on real raindrop images and the generalization in natural scenes.The experimental results indicate that our strategy is capable of reconstructing more convincing images in the presence of diverse appearances and background-affected raindrops.In future work, we will investigate the impact of external potential factors (atmospheric light, etc) on raindrop removal.
) Discriminative features.In contrast to representing an image in the spatial domain, different frequencies can be clearly separated in the Fourier domain.As shown in Fig 1, when the frequency of a point or different regions are missing in the spectrum, the entire generated image is altered and corresponds to the presence of different artifacts.Therefore, reconstructing these missing frequencies can preserve the structural integrity of the background.Motivated by the advantage of frequency-domain representation, we we intend to exploit the frequency distance constraint to compensate for the spatial representation deficiencies.Our DAFL facilitates the recovery of clear images with more details.

Fig 1 .
Fig 1. https://doi.org/10.1371/journal.pone.0301439.g001 double convolution and max-pooling to complete multiple downsampling operations.While the decoder adopts bilinear upsampling operation and double convolution to double the size of the feature map and halve the number of channels, respectively.As depicted in Fig 2,we also design cross-stage fusion connections between the encoder-decoder of the first two stages to help enrich the network with raindrop features.Multi-channel attention module.Instead of using a single-channel mask to predict the clean image, we first employ a multi-channel soft mask to assist in the restoration process.Next, we subtract the residual image from the original raindrop image to acquire the clean image under the constraint of the ground-truth image.As displayed in Fig4, our MCAB block outputs restored rain-free images at the current stage.Synchronously, it generates a multichannel attention map to highlight important information and fuse with the features of the next stage.We introduce MCAB in the first two stages, which considerably enhances the texture profile of the recovered images.